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Abstract. - The most costly and annoying characteristic of the e-mail communication system 
is the large number of unsolicited commercial e-mails, known as spams, that are continuously 
received. Via the investigation of the statistical properties of the spam delivering intertimes, we 
show that spams delivered to a given recipient are time correlated: if the intertime between two 
consecutive spams is small (large), then the next spam will most probably arrive after a small 
(large) intertime. Spam temporal correlations are reproduced by a numerical model based on the 
random superposition of spam sequences, each one described by the Omori law. This and other 
experimental findings suggest that statistical approaches may be used to infer how spammers 
operate. 



Introduction. — Quoting from Ref. [1], a press re- 
lease of the European Union: "The proliferation of unso- 
licited commercial e-mail, or 'spam', has reached a point 
where it creates a major problem for the development of 

' e-commerce and the Information Society. Businesses and 
individuals spend an increasing amount of time and money 
simply to clean up e-mailboxes. The loss in productivity 
for EU businesses has been estimated at 2.5 billion € for 
2002. [. . .] Spam has the potential of destroying some of 

. the major benefits brought about by services such as e- 
mail and SMS." 

Spams, defined as undesired commercial e-mails, are es- 
timated to be 70 — 80% of all e-mails [2], as everybody 
has probably noticed when opening his e-mail box. Such 
a large number is explained by considering that the daily 
earning of a spammer is proportional to the number of 
spams sent. To reduce the nuisance caused by spams, an 
enormous effort has been devoted to design efficient spam 
filters (see [3] and references therein), able to quickly dis- 
criminate between a spam and a legitimate e-mail. Much 
less effort has been devoted to the problem of understand- 
ing how spammers operate, which is the crucial informa- 
tion required to fight spammers at the source. In this 
paper, we present a statistical analysis of the spamming 
process, which may help unveil how spammers operate. 

Dataset. — Our analysis has been made possible by 
modern antispam filters, able to discriminate with good 



accuracy between legitimate e-mails and spams. These 
filters can be configured in such a way that spams are 
not erased, but collected in an appropriate folder: we call 
this folder the junk folder. We have considered four junk 
folders, Ji, J2, J3, J4, belonging to four academic e-mail 
accounts of our university (domain 'na.infn.it'). The fold- 
ers are created by the antispam filter "Sophos" , and con- 
tain respectively 16 • 10 4 , 27 • 10 3 , 21 • 10 3 , and 7 • 10 3 
spams. For comparison, we have also considered one stan- 
dard inbox folder, /, containing 4 • 10 3 legitimate e-mails. 
The popularity of the four accounts we have considered 
among spammers varies, as the mean intertime between 
two consecutive spams is, respectively, 300s, 700s, 1100s 
and 870s seconds. For each e-mail, we have determined 
the time of arrival and the geographical location of the 
sender. The delivering time of an e-mail ti is registered by 
the incoming mail server. In order to obtain an estimate 
of the error on ti, we set-up a script to send at a regu- 
lar interval, ideiay, e-mails from an account Bob (based in 
the USA) to a different account, Alice, (based in Italy). 
The intertime between two consecutive e-mails delivered 
to Alice is not constant and equal to tdciay, but fluctuates. 
The typical size of these fluctuations (which may depend 
on the internet routing) is 10s. This value is our estimate 
for the error on the delivering times. The geographical 
location of the sender is determined from the IP address 
of the sender [5], which is recorded in the envelop which 
complements any e-mail. 
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Fig. 1: (color online) Daily dependence of the probability of 
receiving an e-mail. I indicates the probability of receiving 
a regular e-mail, which is highly structured. Ji indicates the 
probability of receiving a spam, which on the contrary exhibits 
small oscillations during the day. The index i = 1, 2, 3, 4 iden- 
tifies the investigated datasets. 

Data analysis. — The investigation of the e-mails re- 
ceived by the incoming mail server [4], as shown in Fig. 1, 
reveals that spams and legitimate e-mails have different 
statistical features. For instance, regular e-mails traffic 
has a temporal modulation which clearly reproduces hu- 
mans activity (small activity during the night and at lunch 
time), whereas spams appear to be insensitive to it. This is 
a clear signature of the different mode of operation of the 
e-mail senders, "whereas legitimate e-mail transmissions 
are driven by social bilateral relationships, spam trans- 
missions are a unilateral spammer-driven action" [4]. 

As shown in Fig. 2, spams are received at constant rate; 
yet the time series of the arrival times U is characterized by 
bursts, short temporal periods during which many corre- 
lated spams are received. Here we suggest that the origin 
of these bursts lies in the use of peer-to-peer (i.e. de- 
centralized) network of infected computers to send spams, 
known as botnets. 

In order to check whether the spamming process could 
be considered a stationary Poisson process, we have in- 
vestigated the probability distribution P(At) of the inter- 
times between two consecutive events. For regular e-mails, 
where correlations are expected and build up when reply- 
ing to previous e-mails [6,7], the probability distribution 
is characterized by a crossover between two power-laws, 
as shown in Fig. 3a (inset). For the dataset we have in- 
vestigated the corresponding exponents are —0.2 at short 
intertimes, and —1.7 at large intertimes, but it has been 
suggested that these exponents may depend on the par- 
ticular mailbox [8]. 

Fig 3a shows the spam intertime probability distribu- 
tions. We have computed these probability distributions 
for each junk folder, as well as for the catalogs obtained 
from each junk folder by considering only spams originat- 
ing from specific geographical locations, China (CHN), 



Fig. 2: (color online) The cumulated number of delivered e- 
mails N(t) increases linearly in time. In order to rescale the 
data of different datasets, here we plot N(t)/N, where N is the 
number of e-mails of the dataset, as a function of the rescaled 
time t = (t — ti)/(tN — ii), where ti and tjv are the time of 
arrival of the first and of the last e-mail of the dataset. For 
a stochastic process at constant rate N(t) increases linearly 
in time. The inset shows how the number of received spams 
increases in a temporal window of 3h, and shows the existence 
of bursts. 

European Union (EU) and United States (USA). These 
different distributions are also shown in Fig 3a. Contrary 
to the case of regular e-mails, the intertime probability 
distributions between spams show universal behaviour if 
time is rescaled by the average rate in each dataset. In 
fact, P(t) can be expressed as 

P(At) = l/r/(At/r), (1) 

with / an universal function and r the mean intertime, as 
shown by the data collapse obtained by plotting rP(At) 
versus At/r in Fig 3b. The scaling function is well de- 
scribed by a generalized gamma distribution, 
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where a is a normalization constant, and (3 = —0.25 is 
found via a regression procedure. The exponent (3 can 
also be determined from the power spectrum of the in- 
tertime distributions, since if P{At) cx At? at small At, 
then S(oj) cx | J P(At)e~ lulAt dAt\ 2 cx u~ 2U3+1) at large 
ui. As — —0.25 we expect S(lu) cx w -15 , and Fig. 3(b) 
(inset) shows that this is actually the case. As P(At) is 
not a simple exponential, the spamming process cannot be 
considered a homogeneous Poisson process. 

We show now that the non-Poissonian nature of the 
spamming process is due to the presence of temporal 
correlations between spams. We have checked that this 
is actually the case first by computing the mean value 
(At next (To)) of the intertimes following an intertime of 
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Fig. 3: (color online) (a) The distribution of intertimes between 
two consecutive e-mails of the junk folders is well described 
by a Gamma distribution. This is true both for the whole 
distribution of intertimes, as well as for the distribution of the 
intertimes sent by given geographical locations (the analysis 
is restricted to datasets with more than 500 points). Inset: 
the distribution of the intertimes between regular e-mails, (b) 
Collapse of the intertime distribution of the junk folders. The 
dotted line is a fit to Eq. 2, the thick dashed line is the result of 
our model. Inset: the power spectrum of P(At) for our largest 
spam dataset. 

size ro: 

(At„ext(,ToJ) = „ , r , (3) 

where S(Ati,r ) = 1 if 0.9t < Aij < l.lr , otherwise. 

In absence of correlations, one expects At nex t(To)/T ~ 
1. We have determined this quantity both for the origi- 
nal time series of the intertimes At, and for the reshuffled 
time series. These are obtained from the original time se- 
ries by repeatedly exchanging the values of two randomly 
chosen intertimes. Fig. 4a (inset) shows that the original 
catalog is characterized by a positive correlation between 
At nex t/T and ro, which implies that large intertimes are 
most probably followed by large intertimes. Conversely, 
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Fig. 4: (color online) (a) Conditional probability distribution 
P(At/r\ro), for to € Qi, to £ Q4 of the intertimes between 
spams of the dataset Ji. Inset: mean value of the intertime 
At nex t/T following an intertime to. (b) The same quantities are 
shown for the reshuffled catalog, where correlations are absent, 
(c) The same quantities are shown for the synthetic catalog, 
where correlations are found. 

for the reshuffled catalog At nex t(To)/T ~ 1 (see Fig. 4b 
(inset)), indicating absence of correlations. 

A further check of the existence of correlations is pro- 
vided by the study of the conditional probability P(At|ro), 
which is the probability of having an intertime At fol- 
lowing an intertime At = tq. To improve the statis- 
tics, following [9], we have determined five intertimes 
5*0 = minAii < 6t* < St% < St 3 < 5t\ = maxAii, and 
defined four subsets Qk, k — 1,4, containing respectively 
the intertimes enclosed between 6t k \_ 1 and St^,. The inter- 
times 8t* kl k = 1,2,3 are chosen in such a way that the 
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four subsets contain the same number of elements. We 
have then computed Pi(At / t\tq) , To E Qi, i = 1,2,3,4. 
For instance, Pi is the probability distribution of the in- 
tertimes (normalized by the mean intertime) following the 
smallest intertimes, whereas P4 is the probability distri- 
bution of the intertimes following the largest intertimes. 
In absence of correlations these conditional probabilities 
should all coincide, being equal to the unconditional prob- 
ability distribution. We show in Fig. 4a Pi and P4: it is 
apparent the presence of a systematic difference between 
the two curves, particularly at large intertimes, where 
P4(At|r ) > Pi(At/r\To). Pi and P4 are also described 
by Eq. 1, but the exponents which characterize their power 
law behavior at short intertimes arc different, being equal 
to (5 = —0.35 and [3 = —0.20, respectively. Conversely, for 
the reshuffled catalog, Pi and P 4 do coincide, as shown 
in Fig. 4b. From this analysis, we can positively conclude 
that there exist temporal correlations between spams. 

Theoretical model. — These results are very simi- 
lar to those found in the study of earthquakes [10] , where 
one investigates the statistical features of catalogs regis- 
tering the time of occurrence of each earthquake, as well 
as its magnitude. Similar results have been also found 
in financial markets [11] and climate [12]. The study 
of earthquakes catalogs has shown that the distribution 
of intertimes between earthquakes is described by the 
same functional form we found for the intertimes between 
spams [10]. Indeed for worldwide seismicity, where the 
occurrence rate is constant, Corral [10] found a Gamma 
function with a power law initial regime with an exponent 
(3 ~ —0.3, close to our value. Also the study of the condi- 
tional probability distributions [9] gives similar results. A 
deeper understanding of seismic catalogs is however made 
possible by the fact that earthquakes are characterized by 
a magnitude, a quantity which has not a counterpart in 
the case of spams. Correlating the magnitude of an earth- 
quake with its occurrence time, it makes possible to clarify 
that seismic catalogs can be considered as the random su- 
perposition of sequences correlated events. Namely, if the 
first event of the sequence (the mainshock) occurs at time 
t = 0, then the probability that subsequent events (after- 
shocks) occur at time t is P(t) oc t~ p , where p ~ 1 (Omori 
law). The sequences are identified considering that the 
mainshock usually has magnitude greater than subsequent 
quakes, and actually triggers them. 

These considerations suggest that the spam timeseries 
could also be considered as the random superposition of 
power law correlated bursts. It is difficult to directly val- 
idate this possibility, as the absence of a variable which 
plays the role of the magnitude makes difficult the identi- 
fication of the starting event in the burst, even though in 
some cases bursts are clearly observable in the timeseries, 
as in Fig. 2 (inset). To verify this possibility, we have con- 
structed a synthetic catalog composed by a superposition 
of N s = 10 5 sequences of rib events The first element of a 
sequence occurs at time to, followed by the others occur- 



ring at time t > t with probability P(t — t ) oc (f — t )~ p , 
truncated at t = to + 10t*, with t* conventionally set equal 
to 1. For each sequence, the starting time to is randomly 
chosen in the interval [0 : t*N s ], and therefore the mean 
intertime between the beginning of two consecutive se- 
quences is t* . We have then determined p and rib fitting 
the intertime distribution from the model to a Gamma 
distribution with power low exponent —0.25. The best 
fit, which is shown in Fig. 3(b), is obtained with p = 0.8 
and rib = 10. Interestingly, the value of the exponent 
p is close to the measured value of the Omori exponent 
for earthquakes. The model exhibits a constant rate, and 
does also reproduce the intertimes distribution, as shown 
in Fig. 3(b), as well as the spam correlations, as shown in 
Fig. 4c. These results suggest that, as earthquakes, cli- 
mate and financial markets, also spams occur in bursts of 
evenets correlated according to the Omori law. 

Conclusions. — It remains open the understanding of 
the origin of the spam bursts. A step in this direction is 
given by the analysis of the IP addresses of the computers 
from which spams are sent. This shows that 1) it never 
happens that a recipient receives two spams sent by the 
same IP in a time-window of few hours, and that 2) it 
occurs that the same spam is sent by the same IP to two 
different recipients. These evidences, together with the ob- 
servation that a spammer does not send all of his messages 
from a single IP, both because it would be too easily dis- 
covered, and because it would require too much time due 
to bandwidth limitations, strongly supports the idea that 
spammers operate via the use of networks of infected com- 
puters, known as botnets [13]. The computational task 
of sending the e-mails, as well as the bandwidth require- 
ments, is therefore divided among the infected comput- 
ers. At the present time, there is little information about 
the structure of botnets, as previous studies have inves- 
tigated the strategies used to infect a computers, as well 
as the protocols used by infected computers to communi- 
cate. Nevertheless, our result suggests that botnets are 
highly dynamical networks, new infected computers being 
continuously added to replace those that disappear, either 
because they are switched off, or because their infection is 
removed via the use of antivirus tools. Botnets may also 
explain the origin of correlations between spams. Indeed, 
it seems possible that each burst represents the activity 
of a single spammer. The spammer sends simultaneously 
more messages to the same user, who receives them at dif- 
ferent arrival times, the arrival time of each spam depend- 
ing on the path followed on the net. As a consequence, 
the intertimes distribution and the intertimes correlations 
may result from the topology of the botnet. 
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